Tree-Values: Selective Inference for Regression Trees

We consider conducting inference on the output of the Classification and Regression Tree (CART) (Breiman et al., 1984) algorithm. A naive approach to inference that does not account for the fact that the tree was estimated from the data will not achieve standard guarantees, such as Type 1 error rate control and nominal coverage. Thus, we propose a selective inference framework for conducting inference on a fitted CART tree. In a nutshell, we condition on the fact that the tree was estimated from the data. We propose a test for the difference in the mean response between a pair of terminal nodes that controls the selective Type 1 error rate, and a confidence interval for the mean response within a single terminal node that attains the nominal selective coverage. Efficient algorithms for computing the necessary conditioning sets are provided. We apply these methods in simulation and to a dataset involving the association between portion control interventions and caloric intake.


Introduction
Regression tree algorithms recursively partition covariate space using binary splits to obtain regions that are maximally homogeneous with respect to a continuous response. The Classification and Regression Tree (CART; Breiman et al. 1984) proposal, which involves growing a large tree and then pruning it back, is by far the most popular of these algorithms.
The regions defined by the splits in a fitted CART tree induce a piecewise constant regression model where the predicted response within each region is the mean of the observations in that region. CART is popular in large part because it is highly interpretable; someone without technical expertise can easily "read" the tree to make predictions, and to understand why a certain prediction is made. However, its interpretability belies the fact that CART trees are highly unstable: a small change to the training dataset can drastically change the structure of the fitted tree. In the absence of an established notion of statistical significance associated with a given split in the tree, it is hard for a practitioner to know whether they are interpreting signal or noise. In this paper, we use the framework of selective inference to fill this gap by providing a toolkit to conduct inference on hypotheses motivated by the output of the CART algorithm.
Given a CART tree, consider testing for a difference in the mean response of the regions resulting from a binary split. A very naive approach, such as a two-sample Z-test, that does not account for the fact that the regions were themselves estimated from the data will fail to control the selective Type 1 error rate: the probability of rejecting a true null hypothesis, given that we decided to test it [Fithian et al., 2014]. Similarly, a naive Z-interval for the mean response in a region will not attain nominal selective coverage: the probability that the interval covers the parameter, given that we chose to construct it.
In fact, approaches for conducting inference on the output of a regression tree are quite limited. Sample splitting involves fitting a CART tree using a subset of the observations, which will naturally lead to an inferior tree to the one resulting from all of the observations, and thus is unsatisfactory in many applied settings; see Athey and Imbens [2016]. Wager and Walther [2015] develop convergence guarantees for unpruned CART trees that can be leveraged to build confidence intervals for the mean response within a region; however, they do not provide finite-sample results and cannot accommodate pruning. Loh et al. [2016] and Loh et al. [2019] develop bootstrap calibration procedures that attempt to provide confidence intervals for the regions of a regression tree. In Appendix A, we show that this bootstrap calibration approach fails to provide intervals that achieve nominal coverage for the parameters of interest in this paper.
As an alternative to performing inference on a CART tree, one could turn to the conditional inference tree (CTree) framework of Hothorn et al. [2006]. This framework uses a different tree-growing algorithm than CART, and at each split tests for linear association between the split covariate and the response. As summarized in Loh [2014], the CTree framework alleviates issues with instability and variable selection bias associated with CART.
Despite these advantages, CTree remains far less widely-used than CART. Furthermore, while CTree attaches a notion of statistical significance to each split in a tree, it does not directly allow for inference on the mean response within a region or the difference in mean response between two regions. Finally, while the CTree framework requires few assumptions, its inference is based on asymptotics.
In this paper, we introduce a finite-sample selective inference [Fithian et al., 2014] framework for the difference between the mean responses in two regions, and for the mean response in a single region, in a pruned or unpruned CART tree. We condition on the event that CART yields a particular set of regions, and thereby achieve selective Type 1 error rate control as well as nominal selective coverage.
The rest of this paper is organized as follows. In Section 2, we review the CART algorithm, and briefly define some key ideas in selective inference. In Section 3, we present our proposal for selective inference on the regions estimated via CART. We show that the necessary conditioning sets can be efficiently computed in Section 4. In Section 5 we compare our framework to sample splitting and CTree via simulation. In Section 6 we compare our framework to CTree on data from the Box Lunch Study. The discussion is in Section 7.
Technical details are relegated to the supplementary materials.

Notation for Regression Trees
Given p covariates (X 1 , . . . , X p ) measured on each of n observations (x 1 , . . . , x n ), let x j,(s) denote the sth order statistic of the jth covariate, and define the half-spaces χ j,s,1 = z ∈ R p : z j ≤ x j,(s) , χ j,s,0 = z ∈ R p : z j > x j,(s) . (1) The following definitions are illustrated in Figure 1.
Definition 1 (Tree and Region). Consider a set S such that R ⊆ R p for all R ∈ S. Then S is a tree if and only if (i) R p ∈ S; (ii) every element of S \ R p equals R ∩ χ j,s,e for some R ∈ S, j ∈ {1, . . . , p}, s ∈ {1, . . . , n − 1}, e ∈ {0, 1}; (iii) R ∩ χ j,s,e ∈ S implies that R ∩ χ j,s,1−e ∈ S for e ∈ {0, 1}; and (iv) for any R, R ∈ S, R ∩ R ∈ {∅, R, R }. If R ∈ S and S is a tree, then we refer to R as a region.
We use the notation tree to refer to a particular tree. Definition 1 implies that any region R ∈ tree \{R p } is of the form R = ∩ L l=1 χ j l ,s l ,e l , where for each l = 1, . . . , L, we have that j l ∈ {1, . . . , p}, s l ∈ {1, . . . , n − 1}, and e l ∈ {0, 1}. We call L the level of the region, and use the convention that the level of R p is 0.
Definition 3 (Descendant and Ancestor). If R, R ∈ tree and R ⊆ R , then R is a descendant of R , and R is an ancestor of R.
Definition 4 (Terminal Region). A region R ∈ tree without descendants is a terminal region.
We let desc(R, tree) denote the set of descendants of region R in tree, and we let term(R, tree) denote the subset of desc(R, tree) that are terminal regions.
Given a response vector y ∈ R n , letȳ R = i: is an indicator variable that equals 1 if the event A holds, and 0 otherwise. Then, a tree tree induces the regression modelμ(x) = R∈term(R p ,tree)ȳ R 1 (x∈R) . In other words, it predicts the response within each terminal region to be the mean of the observations in that region. [Breiman et al., 1984] The CART algorithm [Breiman et al., 1984] greedily searches for a tree that minimizes the sum of squared errors R∈term(R p ,tree) i:x i ∈R (y i −ȳ R ) 2 . It first grows a very large tree via recursive binary splits, starting with the full covariate space R p . To split a region R, it selects the covariate x j and the split point x j,(s) to maximize the gain, defined as gain R (y, j, s) ≡ i∈R (y i −ȳ R ) 2 −    i∈R∩χ j,s,1 y i −ȳ R∩χ j,s,1 2 + i∈R∩χ j,s,0

A Review of the CART Algorithm
Details are provided in Algorithm A1.
Once a very large tree has been grown, cost-complexity pruning is applied. We define the average per-region gain in sum-of-squared errors provided by the descendants of a region R, g(R, tree, y) = i:x i ∈R (y i −ȳ R ) 2 − r∈term(R,tree) i:x i ∈r (y i −ȳ r ) 2 | term(R, tree)| − 1 . (3) Given a complexity parameter λ ≥ 0, if g(R, tree, y) < λ for some R ∈ tree, then costcomplexity pruning removes R's descendants from tree, turning R into a terminal region.
Details are in Algorithm A2, which involves the notion of a bottom-up ordering.
Definition 5 (Bottom-up ordering). Let tree = {R 1 , . . . , R K }. Let π be a permutation of There are other equivalent formulations for cost-complexity pruning (see Proposition 7.2 in Ripley [1996]); the formulation in Algorithm A2 is convenient for establishing the results in this paper.
To summarize, the CART algorithm first applies Algorithm A1 to the initial region R p and the data y to obtain an unpruned tree, which we call tree 0 (y). It then applies Algorithm A2 to tree 0 (y) to obtain an optimally-pruned tree using complexity parameter λ, which we call tree λ (y).
Grow(R, y) 1. If a stopping condition is met, return R.
Algorithm A2 (Cost-complexity pruning). Parameter O is a bottom-up ordering of the K regions in tree.
Prune(tree, y, λ, O) 1. Let tree 0 = tree. Let K be the number of regions in tree 0 .

A Brief Overview of Selective Inference
Here, we provide a very brief overview of selective inference; see Fithian et al. [2014] or Taylor and Tibshirani [2015] for a more detailed treatment.
Consider conducting inference on a parameter θ. Classical approaches assume that we were already interested in conducting inference on θ before looking at our data. If, instead, our interest in θ was sparked by looking at our data, then inference must be performed with care: we must account for the fact that we "selected" θ based on the data [Fithian et al., 2014]. In this setting, interest focuses on a p-value p(Y ) such that the test for H 0 : θ = θ 0 based on p(Y ) controls the selective Type 1 error rate, in the sense that Also of interest are confidence intervals [L(Y ), U (Y )] that achieve (1 − α)-selective coverage for the parameter θ, meaning that Roughly speaking, the inferential guarantees in (4) and (5) can be achieved by defining p-values and confidence intervals that condition on the aspect of the data that led to the selection of θ. In recent years, a number of papers have taken this approach to perform selective inference on parameters selected from the data in the regression [Lee et al., 2016, Liu et al., 2018, Tian and Taylor, 2018, Tibshirani et al., 2016, clustering [Gao et al., 2020], and changepoint detection [Hyun et al., 2021, Jewell et al., 2022 settings.
In the next section, we propose p-values that satisfy (4) and confidence intervals that satisfy (5) in the setting of CART, where the parameter of interest is either the mean response within a region, or the difference between the mean responses of two sibling regions.
3 The Selective Inference Framework for CART

Inference on a Pair of Sibling Regions
Throughout this paper, we assume that Y ∼ N n (µ, σ 2 I n ) with σ > 0 known.
We let X ∈ R n×p denote a fixed covariate matrix. Suppose that we apply CART with complexity parameter λ to a realization y = (y 1 , . . . , y n ) T from Y to obtain tree λ (y). Given sibling regions R A and R B in tree λ (y), we define a contrast vector ν sib ∈ R n such that and . Now, consider testing the null hypothesis of no difference in means between R A and R B , i.e. H 0 : ν T sib µ = 0 versus H 1 : ν T sib µ = 0. This null hypothesis is of interest because R A and R B appeared as siblings in tree λ (y). A test based on a p-value of the form pr H 0 (|ν T sib Y | ≥ |ν T sib y|) that does not account for this will not control the selective Type 1 error rate in (4).
To control the selective Type 1 error rate, we propose a p-value that conditions on the aspect of the data that led us to select ν T sib µ, But (7) depends on a nuisance parameter, the portion of µ that is orthogonal to ν sib . To remove the dependence on this nuisance parameter, we condition on its sufficient statistic The resulting p-value, or "tree-value", is defined as  (6), in the sense that Proofs of all theoretical results are provided in the appendix. Theorem 1 says that given the set S λ sib (ν sib ), we can compute the p-value in (8) using where F ( · ; 0, ν 2 σ 2 , S) denotes the cumulative distribution function of the N (0, ν 2 2 σ 2 ) distribution truncated to the set S. In Section 4, we provide an efficient approach for analytically characterizing the truncation set S λ sib (ν sib ). To avoid numerical issues associated with the truncated normal distribution, we compute (11) using methods described in the supplement of Chen and Bien [2020]. Note that the proof of Theorem 1, and consequently the efficient computation of p sib (y) discussed in Section 4, relies on the assumption that Y ∼ N n (µ, σ 2 I n ).
We now consider inverting the test proposed in (8) to construct an equitailed confidence interval for ν T sib µ that has (1 − α)-selective coverage (5), in the sense that Proposition 1. For any 0 ≤ α ≤ 1 and any realization y ∈ R n , the values L(y) and U (y) that satisfy

Inference on a Single Region
Given a single region R A in a CART tree, we define the contrast vector ν reg such that Then, ν T reg µ = i: . We now consider testing the null hypothesis H 0 : ν T reg µ = c for some fixed c. Because our interest in this null hypothesis results from the fact that R A ∈ tree λ (y), we must condition on this event in defining the p-value. We define and introduce the following theorem.
Theorem 2. The test based on the p-value p reg (y) in (15) controls the selective Type 1 error rate for H 0 : ν T reg µ = c, where ν reg is defined in (14). Furthermore, p reg (y) = Theorem 2 and the resulting efficient computations in Section 4 rely on the assumption that Y ∼ N n (µ, σ 2 I n ).
We can also define a confidence interval for ν T reg µ that attains nominal selective coverage.
In Section 4, we propose an approach to analytically characterize the set S λ reg (ν reg ) in (16).
3.3 Intuition for the Conditioning Sets S λ sib (ν sib ) and S λ reg (ν reg ) We first develop intuition for the set S λ sib (ν sib ) defined in (10). From Theorem 1, Thus, y (φ, ν sib ) is a perturbation of y that exaggerates the difference between the observed sample mean responses of R A and R B if |φ| > |ν T sib y|, and shrinks that difference if |φ| < |ν T sib y|. The set S λ sib (ν sib ) quantifies the amount that we can shift the difference in sample mean responses between R A and R B while still producing a tree containing these sibling  Output of CART applied to y (φ, ν sib ), where ν sib in (6) encodes the contrast between R A and R B , for various values of φ. The left-most panel displays y = y (ν T sib y, ν sib ). By inspection, we see that −14.9 ∈ S 0 sib (ν sib ) and 5 ∈ S 0 sib (ν sib ), but 0 ∈ S 0 sib (ν sib ) and 40 ∈ S 0 sib (ν sib ). In fact, S 0 sib (ν sib ) = (−19.8, −1.8) ∪ (0.9, 34.9). Bottom: Output of CART applied to y (φ, ν reg ), where ν reg in (14) encodes membership in R A . The left-most panel displays y = y (ν T reg y, ν reg ).
We next develop intuition for S λ reg (ν reg ), defined in (16). Note that {y (φ, ν reg )} i = , where y (φ, ν reg ) is defined in Theorem 2. Thus, y (φ, ν reg ) shifts the responses of the observations in R A so that their sample mean equals φ, and leaves the others unchanged. The set S λ reg (ν reg ) quantifies the amount that we can exaggerate or shrink the sample mean response in region R A while still producing a tree that contains 4 Computing the conditioning sets S λ sib (ν sib ) and S λ reg (ν reg )

Recharacterizing the conditioning sets in terms of branches
We begin by introducing the concept of a branch.
For a branch B and a vector ν, we define For R ∈ tree, we let branch(R, tree) denote the branch such that R{branch(R, tree)} contains R and all of its ancestors in tree.
Lemma 1. Suppose that R A and R B are siblings in tree λ (y). Then R A and R B are sib- (10) and (18).
Lemma 1 says that tree λ {y (φ, ν sib )} contains siblings R A and R B if and only if it contains the entire branch associated with R A in tree λ (y). However, Lemma 1 does not apply in the single region case: for ν reg defined in (14) and some R A ∈ tree λ (y), the fact that R A ∈ tree λ {y (φ, ν reg )} does not imply that R[branch{R A , tree λ (y)}] ⊆ tree λ {y (φ, ν reg )}.
Instead, a result similar to Lemma 1 holds, involving permutations of the branch.
Branch B and its permutation π (B) induce the same region R (L) , but R{π (B)} = R(B).
Lemmas 1 and 2 reveal that computing S λ sib (ν sib ) and S λ reg (ν reg ) requires characterizing sets of the form S λ (B, ν), defined in (18). To compute S λ sib (ν sib ) we will only need to consider S λ (B, ν) where R(B) ⊆ tree λ (y). However, to compute S λ reg (ν reg ), we will need to consider (18) Throughout this section, we consider a vector ν ∈ R n and a branch B = ((j 1 , s 1 , e 1 ), . . . , (j L , s L , e L )), where R(B) may or may not be in tree λ (y). Recall from Definition 6 that B induces the nested regions R (l) = l l =1 χ j l ,s l ,e l for l = 1, . . . , L, and R (0) = R p . Throughout this section, our only requirement on B and ν is the following condition. To characterize S λ (B, ν) in (18), recall that the CART algorithm in Section 2.2 involves growing a very large tree tree 0 (y), and then pruning it. We first characterize the set Proposition 3. Recall the definition of gain R (l) {y (φ, ν), j, s} in (2), and let S l,j,s = {φ : Proposition 4 says that we can compute S grow (B, ν) efficiently. Noting that S λ (B, ν) = φ ∈ S grow (B, ν) : R (L) ∈ tree λ {y (φ, ν)} , it remains to characterize the set of φ ∈ S grow (B, ν) such that R (L) is not removed during pruning. Recall that g(·) was defined in (3).
We explain how to compute a tree(B, ν, λ) satisfying (21) when R(B) ∈ tree λ (y) in the supplementary materials. The results in this section have relied upon Condition 1. Indeed, this condition holds for branches B and vectors ν that arise in characterizing the sets S λ sib (ν sib ) and S λ reg (ν reg ).
Combining Lemma 1 with Propositions 3-7, we see that S λ sib (ν sib ) can be computed in O{npL + np log(n)} operations. However, computing S λ reg (ν reg ) is much more computationally intensive: by Lemma 2 and Propositions 3-7, it requires computing operations. In Section 4.3, we discuss ways to avoid these calculations.

A Computationally-Efficient
Alternative to S λ reg (ν reg ) Lemma 2 suggests that carrying out inference on a single region requires computing S λ (π branch{R A , tree λ (y)} , ν reg ) for every π ∈ Π. We now present a less computationally demanding alternative.
The test based on p Q reg (y) controls the selective Type 1 error rate (4) for H 0 : Using the notation in Proposition 8, p reg (y) introduced in (15) equals p Π reg (y). If we take Q = {I}, where I is the identity permutation, then we arrive at In Appendix F, we show through simulation that the loss in power associated with using (22) rather than (15) is negligible. Thus, in practice, we suggest using (22) for its computational efficiency. We use (22) for the remainder of this paper.
Furthermore, we can consider computing confidence intervals of the form [L S I reg (y), U S I reg (y)] rather than (17), where L S I reg (y) and U S I reg (y) satisfy In Appendix F, we show that the confidence intervals resulting from (23) are not much wider than those resulting from (17). We therefore make use of confidence intervals of the form (23) in the remainder of this paper.

Data Generating Mechanism
We simulate X ∈ R n×p with n = 200, p = 10, X ij i.i.d.
∼ N (0, 1), and y ∼ N n (µ, σ 2 I n ) with This µ vector defines a three-level tree, shown in Figure 3 for three values of a ∈ R.

Methods for Comparison
All CART trees are fit using the R package rpart [Therneau and Atkinson, 2019] with λ = 200, a maximum level of three, and a minimum node size of one. We compare three approaches for conducting inference. (i) Selective Z-methods: Fit a CART tree to the data.
For each split, test for a difference in means between the two sibling regions using (8), and Figure 3: The true mean model in Section 5, for a = 0.5 (left), a = 1 (center), and a = 2 (right). The difference in means between the sibling nodes at level two in the tree is ab, while the difference in means between the sibling nodes at level three is b.
compute the corresponding confidence interval in (13). Compute the confidence interval for the mean of each region using (23). (ii) Naive Z-methods: Fit a CART tree to the data.
For each split, conduct a naive Z-test for the difference in means between the two sibling regions, and compute the corresponding naive Z-interval. Compute a naive Z-interval for each region's mean. (iii) Sample splitting: Split the data into equally-sized training and test sets. Fit a CART tree to the training set. On the test set, conduct a naive Z-test for each split and compute a naive Z-interval for each split and each region. If a region has no test set observations, then we fail to reject the null hypothesis and fail to cover the parameter.
The conditional inference tree (CTree) framework of Hothorn et al. [2006] uses a different criterion than CART to perform binary splits. Within a region, it tests for linear association between each covariate and the response. The covariate with the smallest p-value for this linear association is selected as the split variable, and a Bonferroni corrected p-value that accounts for the number of covariates is reported in the final tree. Then, the split point is selected. If, after accounting for multiple testing, no variable has a p-value below a prespecified significance level α, then the recursion stops. While CTree's p-values assess linear association and thus are not directly comparable to the p-values in (i)-(iii) above, it is the most popular framework currently available for determining if a regression tree split is statistically significant. Thus, we also evaluate the performance of (iv) CTree: Fit a CTree to all of the data using the R package partykit [Hothorn and Zeileis, 2015] with α = 0.05.
For each split, record the p-value reported by partykit.
In Sections 5.3-5.6, we assume that σ is known. We consider the case of unknown σ in Section 5.7.

Uniform p-values under a Global Null
We generate 5, 000 datasets with a = b = 0, so that H 0 : ν T sib µ = 0 holds for all splits in all trees. Figure 4 displays the distributions of p-values across all splits in all fitted trees for the naive Z-test, sample splitting, and the selective Z-test. The selective Z-test and sample splitting achieve uniform p-values under the null, while the naive Z-test (which does not account for the fact that ν sib was obtained by applying CART to the same data used for testing) does not. CTree is omitted from the comparison: it creates a split only if the pvalue is less than α = 0.05, and thus its p-values over the splits do not follow a Uniform(0,1) distribution.

Figure 4:
Quantile-quantile plots of the p-values for testing H 0 : ν T sib µ = 0, as described in Section 5.3. A naive Z-test (green), sample splitting (blue), and selective Z-test (pink) were performed; see Section 5.2. The p-values are stratified by the level of the regions in the fitted tree.
As naive Z-tests do not control the Type 1 error rate (Figure 4), we do not evaluate their power. We consider two aspects of power: the probability that we detect a true split, and the probability that we reject the null hypothesis corresponding to a true split.
Given a true split in Figure 3 and an estimated split, we construct the 3 × 3 contingency  3 contingency table corresponding to the shaded region in Table 1. For each true split, we identify the estimated split for which the adjusted Rand index is largest; if this index exceeds 0.75 then this true split is "detected". Given that a true split is detected, the associated null hypothesis is rejected if the corresponding p-value is below 0.05. Figure 5 displays the proportion of true splits that are detected and rejected by each method.
As sample splitting fits a tree using only half of the data, it detects fewer true splits, and thus rejects the null hypothesis for fewer true splits, than the selective Z-test.
When a is small, the difference in means between sibling regions at level two is small.
Because CTree makes a split only if there is strong evidence of association at that level, it tends to build one-level trees, and thus fails to detect many true splits; by contrast, the selective Z-test (based on CART) successfully builds more three-level trees. Thus, when a is small, the selective Z-test detects (and rejects) more true differences than CTree between regions at levels two and three. Figure 5: Proportion of true splits detected (solid lines) and rejected (dotted lines) for CART with selective Z-tests (pink), CTree (black), and CART with sample splitting (blue) across different settings of the data generating mechanism, stratified by level in tree. As CTree only makes a split if the p-value is less than 0.05, the proportion of detections equals the proportion of rejections.
5.5 Coverage of Confidence Intervals for ν T sib µ and ν T reg µ We generate 500 datasets for each (a, b) ∈ {0.5, 1, 2} × {0, . . . , 10} to evaluate the coverage of 95% confidence intervals constructed using naive Z-methods, selective Z-methods, and sample splitting. CTree is omitted from these comparisons because it does not provide confidence intervals. We say that the interval covers the truth if it contains ν T µ, where ν is defined as in (6) (for a particular split) or (14) (for a particular region). Table 2 shows the proportion of each type of interval that covers the truth, aggregated across values of a and b. The selective Z-intervals attain correct coverage of 95%, while the naive Z-intervals do not.
It may come as a surprise that sample splitting does not attain correct coverage. Recall that ν from (6) or (14) is an n-vector that contains entries for all observations in both the training set and the test set. Thus, ν T µ involves the true mean among both training

Width of Confidence Intervals
Figure 6(a) illustrates that our selective Z-intervals for ν T reg µ can be extremely wide when b is small, particularly for regions located at deeper levels in the tree. For each tree that we build and for levels 1, 2, and 3, we compute the adjusted Rand Index [Hubert and Arabie, 1985] between the true tree (truncated at the appropriate level) and the estimated tree (truncated at the same level). Figure 6(b) shows that our selective confidence intervals can be extremely wide when this adjusted Rand Index is small, particularly at deeper levels of the tree.
When b is small and the adjusted Rand Index is small, the trees built by CART tend to be unstable, in the sense that small perturbations to the data affect the fitted tree. In this setting, the sample statistics ν T reg y fall very close to the boundary of the truncation set. See Kivaranovic and Leeb [2021] for a discussion of why wide confidence intervals can arise in these settings. The great width of our confidence intervals reflects the uncertainty about the mean response within each region due to the instability of the tree-fitting procedure.

Results with Unknown σ
Thus far, we have assumed that σ is known. In this section, we compare the following three versions of the selective Z-methods that plug different values of σ into the truncated normal CDF when computing p-values and confidence intervals: 1. σ: We plug in the true value of σ, as in Sections 5.3-5.6.       . However, as shown in Figure 8, selective Z-tests based onσ cons can have very low power. One promising avenue of future work involves providing theoretical guarantees in the regression tree setting for estimators that are less conservative thanσ cons .  because it provides p-values for each split, even though CART has higher predictive accuracy.

2.σ cons
Since our framework provides p-values for each split in a CART tree, we revisit their analysis of the Box Lunch Study, a clinical trial studying the impact of portion control interventions on 24-hour caloric intake. We consider identifying subgroups of study participants with baseline differences in 24-hour caloric intake on the basis of scores from an assessment that quantifies constructs such as hunger, liking, the relative reinforcement of food (rrvfood), and restraint (resteating).
We exactly reproduce the trees presented in Figures 1 and 2  We see from the bottom right of Figure 9 that while there is evidence of a linear relationship between hunger and caloric intake, there is less evidence of a difference in means across the particular split hunger=1.8. Given that the goal of Venkatasubramaniam et al. [2017] is to "identify population subgroups that are relatively homogeneous with respect to an outcome", the p-value resulting from our selective framework is more natural than the pvalue output by CTree, since the former relates directly to the subgroups formed by the split, whereas the latter does not take into account the location of the split point. In general, the left-hand panel of Figure 9 shows that the subgroups of patients identified by CART are not significantly different from one another. This is an important finding that would be missed without our selective inference framework. Furthermore, unlike CTree, our framework provides confidence intervals for the mean response in each subgroup.
An alternative analysis usingσ cons , defined in Section 5.7, is provided in Appendix H, and leads to similar findings.

Discussion
Our framework relies on the assumption that Y ∼ N n (µ, σ 2 I), with σ 2 known. In Section 5.7, we showed strong empirical performance when the variance is unknown and σ 2 is estimated.
In this section, we briefly comment on the assumptions of spherical variance and normally distributed data.
It natural to wonder whether the assumption that Y ∼ N n (µ, σ 2 I) can be relaxed to the assumption that Y ∼ N n (µ, Σ), with Σ known. Following the work of Lee et al. [2016], the results in Section 3 extend to the setting where Y ∼ N n (µ, Σ) if we: 1. Modify (8) and (15) to condition on the event where ν = ν sib in the case of (8) and ν = ν reg in the case of (15).
2. Replace all instances of the perturbation y (φ, ν), defined in Theorem 1, with the Unfortunately, the modified perturbation y (φ, ν) does not satisfy Condition 1 in Section 4.2 when Σ = σ 2 I n , and so many of the results of Section 4 do not extend to this non-spherical setting. Future work could explore how to efficiently compute the conditioning set in this non-spherical setting.
Furthermore, our framework assumes a normally-distributed response variable. CART is commonly used for classification, survival [Segal, 1988], and treatment effect estimation in causal inference [Athey and Imbens, 2016]. While the idea of conditioning on a selection event to control the selective Type 1 error rate applies regardless of the distribution of the response, our Theorem 1 and Theorem 2, and the resulting computational results, relied on normality of Y . In the absence of this assumption, exactly characterizing the conditioning set and the distribution of the test statistic requires further investigation.
We show in Appendix G that our selective Z-tests approximately control the selective A reviewer pointed out similarities between the problem of testing significance of the first split in the tree and significance testing for a single changepoint, as in Bhattacharya [1994].
Building on this connection may provide an avenue for future work.
A software implementation of the methods in this paper is available in the R package treevalues, at https://github.com/anna-neufeld/treevalues.  , instantiated to our setting, is as follows. Suppose that R A ∈ Tree λ (y), and define the vector ν reg such that (ν reg ) i = 1 ( , as in (14). We know that the "naive" Z-interval does not achieve nominal coverage, meaning that Pr ν reg µ ∈ ν reg y − z α/2 σ This is because the "multiplier" for the naive confidence interval, z α/2 , is derived under the assumption that the region R A (or, equivalently, the vector ν reg ), is fixed, rather than a function of the data.
Loh et al. [2019] observe that there exists some α < α such that The value for α for a given tree will depend on the number of split covariates p, the number of data points n, the depth of the tree, and the value of λ used for tree pruning, among other considerations. If we know how the data was generated, then we can check whether some value α satisfies (25) as follows: 1. Draw B different simulated datasets {(X b , y b )} B b=1 from the same distribution (call this F ) as the original data. For b = 1, . . . , B: (a) Build a tree using the simulated data (X b , y b ), using the same procedure and the same settings as in Step 1, and denote it tree λ (y b ).
(b) For each terminal region R ∈ term tree λ (y b ), R p in the tree: i. Construct a (1 − α ) naive Z-interval for the mean response in the region using (X b , y b ).
ii. Check if each interval containsμ R , the true mean for this region R.
2. Compute the fraction of intervals in 1(b) that containμ R .
3. If this value is 1 − α, then we have found the correct value of α . If not, then we try a larger or smaller value of α .
We test this procedure in a very simple simulation study. We generate data X ij ∼ N (0, 1) and y i ∼ N (0, 1) for i = 1, . . . , 100 and j = 1, . . . , p. In this simple setting, the true mean response for every region in every fitted tree is 0. We carry out the procedure outlined above with α = 0.1. For two values of p and for CART trees with 1, 2, and 3 levels, we create 1000 datasets and 1000 trees and report the empirical coverage of the intervals obtained using this ideal method, averaged over all nodes in all trees. The results, shown in Table 4, show that this ideal procedure procedure leads to intervals that achieve nominal coverage.
Unfortunately, this ideal procedure is practically infeasible, as it requires the user to know the true distribution of the data F . Thus, in practice, Loh et al.
[2019] propose replacing F byF , the empirical distribution of the original data. This amounts to replacing the simulated datasets in Step 2 with bootstrapped datasets, and checking whether the naive Z-intervals in Step 1(b) containȳ R rather thanμ R .
We can see why this is problematic in a very simple setting where we fit a tree with depth 1. If all observations have mean 0, then a CART tree fit to a bootstrap sample of the data will nevertheless find regions R L and R R such that the sample mean value of y b within R L is negative and the sample mean value of y b within R R is positive. As y b and y contain many overlapping observations, it is likely that the sample mean value of y within R L is also negative and the sample mean value of y within R R is also positive. In other words, because of the overlap between y and y b , the within-region sample means of y b are closer to the within-region sample means of y than they are to the within-region population means.
Thus, when we calibrate α to cover the mean values of y within various regions, we end up with under-coverage of the true population mean, as shown in Table 4.
We see in Table 4 Table 4: Coverage of 90% confidence intervals computed using three methods for the simple setting where y i ∼ N (0, 1) and X ij ∼ N (0, 1) for for i = 1, . . . , 100 and j = 1, . . . , p. Note that the "Loh (ideal)" method can never be used in practice, as it requires knowledge of the true parameter.

B Proofs for Section 3 B.1 Proof of Theorem 1
Let 0 ≤ α ≤ 1. We start by proving the first statement in Theorem 1: This is a special case of Proposition 3 from Fithian et al. [2014]. It follows from the definition of p sib (Y ) in (8) that Therefore, applying the law of total expectation yields The second statement of Theorem 1 follows directly from the following result.
Lemma 3. If Y ∼ N n (µ, σ 2 I n ), then the random variable ν T sib Y has the following conditional distribution:

B.2 Proof of Proposition 1
Theorem 6.1 from Lee et al. [2016] says that the truncated normal distribution has monotone likelihood ratio in the mean parameter. This guarantees that L(y) and U (y) in (13) are unique. Then, for L(·) and U (·) in (13), (26) in Lemma 3 guarantees that Finally, we need to prove that (27) implies (1 − α)-selective coverage as defined in (12).
Following Proposition 3 from Fithian et al. [2014], let η be the random variable P ⊥ ν Y and let f (·) be its density. Then,

B.3 Proof of Theorem 2
We omit the proof of the first statement of Theorem 2, as it is similar to the proof of the first statement of Theorem 1 in Appendix B.1.
The second statement in Theorem 2 follows directly from the following result.
Lemma 4. The random variable ν T reg Y has the conditional distribution where S λ reg (ν reg ) was defined in (16).
We omit the proof of Lemma 4, as it is similar to the proof of Lemma 3.

B.4 Proof of Proposition 2
The proof largely follows the proof of Proposition 1. The fact that the truncated normal distribution has monotone likelihood ratio (Theorem 6.1 of Lee et al. 2016) ensures that L(y) and U (y) defined in (17) are unique, and (28) in Lemma 4 implies that The rest of the argument is as in the proof of Proposition 1.
C Proofs for Section 4.1

C.1 Proof of Lemma 1
We first state and prove the following lemma.
Lemma 5. Let R A and R B be the regions in the definition of ν sib in (6). For an arbitrary region R and for any j ∈ {1, . . . , p} and s ∈ {1, . . . , n − 1}, recall that the potential children of R (the ones that CART will consider adding to the tree when applying Algorithm A1 to region R) are given by R ∩ χ j,s,0 and R ∩ χ j,s,1 , where χ j,s,0 and χ j,s,l were defined in (1). If Proof. It follows from algebra that for Gain R (y, j, s) defined in (2), . It follows from (29) that to prove Lemma 5, it suffices to show that y T = y (φ, ν sib ) T for T ∈ {R, R∩χ j,s,0 , R∩χ j,s,1 }. Recall from Section 3.3 Without loss of generality, assume that ( 1 (x i ∈R∩χ j,s,1 ) i∈R∩χ j,s,1 (y i +0) = y R∩χ j,s,1 .
We will now prove Lemma 1.

D.2 Proof of Proposition 4
Given a region R, let 1(R) denote the vector in R n such that the ith element is 1 (x i ∈R) . Let P 1(R) = 1(R) 1(R) T 1(R) −1 1(R) T denote the orthogonal projection matrix onto the vector 1(R).
Furthermore, the matrix M R,j,s is positive semidefinite.
We now use (34) to prove the first statement of Proposition 4.
Lemma 7. Each set S l,j,s is defined by a quadratic inequality in φ.
Proof. Using the definitions in Lemmas 6 and 7 and algebra, we have that We compute the scalar ν 2 2 and the vector P ⊥ ν y in O(n) operations once at the start of the algorithm. We also sort each feature in O [nlog(n)] operations per feature. We will now show that for the lth level and the jth feature, we can compute a{R (l−1) , j, s}, b{R (l−1) , j, s}, and c{R (l−1) , j, s} for all n − 1 values of s in O(n) operations.
Thus, we can obtain all components of coefficients a{R (l−1) , j, s}, b{R (l−1) , j, s}, and c{R (l−1) , j, s} for a fixed j and l, and for all s = 1, . . . , n − 1, in O(n) operations. These scalar components can be combined to obtain the coefficients in O(n) operations. Therefore, given the sorted features, we compute the (n − 1)pL coefficients in O(npL) operations.
Once the coefficients on the right hand side of (36) have been computed, we can compute S l,j,s in constant time via the quadratic equation: it is either a single interval or the union of two intervals. Finally, in general we can intersect (n − 1)pL intervals in O {npL × log(npL)} operations [Bourgon, 2009]. The final claim of Proposition 4 involves the special case where ν = ν sib and B = branch{R A , tree λ (y)}.
The first equality follows directly from the definition of a(R, j, s) in (35). To see why the second equality holds, observe that R A ∪ R B ⊆ R (l−1) , and without loss of generality assume that R A ∪ R B ⊆ R (l−1) ∩ χ j l ,s l ,1 . Recall that the ith element of ν sib is non-zero if and only if i ∈ R A ∪ R B , and that the non-zero elements of ν sib sum to 0. Thus, 1 R (l−1) T ν sib = 0 and 1 R (l−1) ∩ χ j l ,s l ,1 T ν sib = 0. Furthermore, the supports of R (l−1) ∩ χ j l ,s l ,0 and ν sib are non-overlapping, and so 1 R (l−1) ∩ χ j l ,s l ,0 T ν sib = 0. Thus, Thus, when l < L, S l,j,s is defined in (36) by a quadratic inequality with a non-negative quadratic coefficient. Thus, S l,j,s must be a single interval of the form (a, b). Furthermore, since B = branch{R A , tree λ (y)}, we know that R(B) ⊆ tree 0 (y) = tree 0 {y (ν T y, ν)}.
To prove (ii), we first prove that when l = L the quadratic equation in φ defined in (36) has a non-positive quadratic coefficient. To see this, note that The first equality follows from (35). The second follows from the definition of M R,j,s given in (33) and from the fact that P 1{R (L−1) } ν sib = 0 because 1{R (L−1) } T ν sib sums up all of the non-zero elements of ν sib , which sum to 0. The third equality follows because ν sib lies in span 1 R (L−1) ∩ χ j L ,s L ,1 , 1 R (L−1) ∩ χ j L ,s L ,0 ; projecting it onto this span yields itself.
Noting that P 1{R (L−1) ∩χ j,s,1 } + P 1{R (L−1) ∩χ j,s,0 } is itself a projection matrix, the fourth equality follows from the idempotence of projection matrices, and the inequality follows from the fact that ν sib 2 ≥ Qν sib 2 for any projection matrix Q. Thus, when l = L, the quadratic that defines S l,j,s has a non-positive quadratic coefficient.
Equality is attained in (40)  We now proceed to the setting where the inequality in (40) is strict. In this case, (36) implies that S L,j,s = (−∞, c] ∪ [d, ∞) for c ≤ d and c, d ∈ R. To complete the proof of (ii), we must argue that c ≤ 0 and d ≥ 0. Recall that the quadratic in (34) has the form gain R (L−1) {y (φ, ν), j, s}−gain R (L−1) {y (φ, ν), j L , s L }. When φ = 0, gain R (L−1) {y (φ, ν), j L , s L } = 0, because φ = 0 eliminates the contrast between R A and R B , so that the split on j L , s L provides zero gain. So, when φ = 0, the quadratic evaluates to gain R (L−1) {y (φ, ν), j, s}, which is non-negative by Lemma 6. Thus, S l,j,s is defined by a downward facing quadratic that is non-negative when φ = 0, and so the set S l,j,s has the form ( To prove (iii), observe that (i) implies that

D.3 Proof of Proposition 5
To prove Proposition 5, we first propose a particular method of constructing an example of tree(B, ν, λ). We then show that (21) holds for this particular choice for tree (B, ν, λ). We conclude by evaluating the computational cost of computing such an example of tree(B, ν, λ), and by arguing that in the special case where R(B) ∈ tree λ (y), our example is equal to tree λ (y).
When Algorithm A2 is called with parameters tree, y, λ, O, where O is a bottom-up ordering of the K nodes in tree, it computes a sequence of intermediate trees, tree 0 , . . . , tree K .
We use the notation tree k (tree, y, λ, O), for k = 0, . . . , K, to denote the kth of these intermediate trees. The following lemma helps build up to our proposed example of tree(B, ν, λ).
We next prove by induction that, if we choose a bottom-up ordering O that places the regions in R(B) at the end of the ordering, then for k = 0, . . . , K −L. It follows immediately from Algorithm A2 and the argument above that and denote this tree with tree k−1 for brevity. We must prove that (41)  Since k ≤ K − L, this implies that R ∈ {R (L−1) , . . . , R (0) }. This means that either sib , tree 0 {y (φ 1 , ν)} for l ∈ {1, . . . , L} or R ∈ desc R (L) , tree 0 {y (φ 1 , ν)} , meaning that R is a black region in Figure 10(b). From Condition 1, In any of the three cases illustrated in Figure 10, for g(·) defined in (3), g{R, tree k−1 , y (φ 1 , ν)} = g{R, tree k−1 , y (φ 2 , ν)}. Combining this with (42) and Step 2(b) of Algorithm A2 yields (41). This completes the proof by induction.
Since Lemma 10 guarantees that each φ ∈ S grow (B, ν) leads to the same tree 0 {y (φ, ν)}, we will refer to this tree as tree 0 , will let K be the number of regions in this tree, and will let O be a bottom-up ordering of these regions that places R (L−1) , . . . , R (0) in the last L spots. We will further denote tree K−L {tree 0 , y (φ, ν), λ, O} for any φ ∈ S grow (B, ν) as tree K−L , since Lemma 10 further tells us that this is the same for all φ ∈ S grow (B, ν). In what follows, we argue that if we let tree(B, ν, λ) = tree K−L , where tree(B, ν, λ) appears in the statement of Proposition 5, then (21) holds. In other words, we prove that tree K−L , which always exists and is well-defined, is a valid example of tree(B, ν, λ).
So we can rewrite S λ (B, ν) as Furthermore, since R (L−1) , . . . , R (0) (all of which are ancestors of R (L) ) are the last L nodes in the ordering O, we see that R (L) ∈ tree K {tree 0 , y (φ, ν), λ, O} if and only if no pruning occurs during the last L iterations of Step 2 in Algorithm A2. This means that we can characterize (43) as Recall that for k = K − L + 1, . . . , K, R (K−k) is the kth region in O, and is an ancestor of R (L) . We next argue that we can rewrite (44) as To begin, suppose that φ ∈ (45). As we are talking about a particular φ, for k = 0, . . . , K we will suppress the dependence of tree k {tree 0 , y (φ, ν), λ, O} on its arguments and denote it with tree k . The fact that φ ∈ (45) means that g R (L−1) , tree K−L , y (φ, ν) ≥ λ, which ensures that no pruning occurs at step K − L + 1, which in turn ensures that tree K−L+1 = tree K−L . Combined with (45), this implies that g R (L−2) , tree K−L , y (φ, ν) = g R (L−2) , tree K−L+1 , y (φ, ν) ≥ λ, which ensures that no pruning occurs at step K −L+2, which in turn ensures that tree K−L+2 = tree K−L+1 = tree K−L . Proceeding in this manner, by tracing through the last L iterations of Step 2 of Algorithm A2, we see that φ satisfies tree K = tree K−L , and so φ ∈ (44).
Finally, Proposition 5 rewrites (45) with the indexing over k changed to an indexing over l, and plugging in tree(B, ν, λ) = tree K−L . Therefore, tree K−L is a valid example of tree(B, ν, λ).
In the special case that R(B) ⊆ tree λ (y), we have that ν T y ∈ S grow (B, ν) because y = y (ν T y, ν) and R(B) ⊆ tree λ (y) ⊆ tree 0 (y). In this case, suppose that we carry out the process described in the previous paragraph by selecting φ = ν T y. As R(B) ⊆ tree λ (y), it is clear from Algorithm A2 that no pruning occurs during the last L iterations of Step 2 in Algorithm A2 applied to arguments tree 0 (y), y, λ, and O. Therefore, the process from the previous paragraph returns the optimally pruned tree λ (y). Thus, when R(B) ⊆ tree λ (y), we can simply plug in tree λ (y), which has already been built, for tree(B, ν, λ) in (21).

D.4 Proof of Proposition 6
We first show that we can express as the solution set of a quadratic inequality in φ for l = 0, . . . , L − 1, where g(·) was defined in (3). Because only the numerator of g{R (l) , tree(B, ν, λ), y (φ, ν)} depends on φ, it will be useful to introduce the following concise notation: We begin with the following lemma.
Lemma 12 follows from Lemma 11 and the fact that, due to the form of the vector ν, there are many regions R for which h {R, tree(B, ν, λ), y (φ, ν)} does not depend on φ.
We can now write (46) as φ : g R (l) , tree(B, ν, λ), y (φ, ν) ≥ λ = φ : h R (l) , tree(B, ν, λ), y (φ, ν) ≥ λ | term{R (l) , tree(B, ν, λ)}| − 1 = φ : where the functions a(·), b(·), and c(·) were defined in (35) in Appendix D.2, and where is a constant that does not depend on φ. The first equality simply applies the definitions of h(·) and g(·). The second equality follows from Lemma 12 and moving terms that do not depend on φ to the right-hand-side. The third equality follows from plugging in notation from Appendix D.2 and defining the constant γ l for convenience. Thus, (46) is quadratic inequality in φ. We now just need to argue that its coefficients can be obtained efficiently.
We have now seen that we can obtain the coefficients needed to express (46) Recall from Proposition 3 that S grow (B, ν) is the intersection of O(npL) quadratic sets.
The following lemma explains why, similar to Proposition 4, computation time can be reduced in the special case where ν = ν sib and B = branch{R A , tree λ (y)}.
Intersecting L sets of the form (−∞, a l ) ∪ (b l , ∞) for a l ≤ 0 ≤ b l only takes O(L) operations, because we simply need to identify the minimum a l and maximum b l . Finally, Lemma 9 ensures that S grow (B, ν sib ) has at most two disjoint intervals, and so the final intersection with S grow (B, ν sib ) takes only O(1) operations.
Let B = π branch{R A , tree λ (y)} and let ν = ν reg (14). Note that π branch{R A , tree λ (y)} induces the same region R (L) as the unpermuted branch{R A , tree λ (y)}. Regardless of the permutation, the induced R (L) is equal to R A . Applying the expression for {y (φ, ν reg )} i given in Section 3.3, Condition 1 holds with constant c 2 = 0. First, we will show that the test based on p Q reg (y) controls the selective Type 1 error rate, defined in (4); this is a special case of Proposition 3 from Fithian et al. [2014].
F Effect of Using the Computationally Efficient Alternative in Section 4.3 In this section, we investigate the effect of using p I reg (y) from (22) rather than p reg (y) from (15) on power. We also investigate the effect of using (23) rather than (17) on the width of confidence intervals for ν T reg µ.
We generate data as described in Section 5.1, but for simplicity we restrict our attention to the case where a = 1 (corresponding to the center panel of Figure 3).
For each tree that we build, we consider (a) testing H 0 : ν T reg µ = 0 and (b) constructing a confidence interval for ν T reg µ, for each region appearing at the third level of the tree. For each test, we compare the test that uses the full conditioning set (i.e. that uses p reg (y) from (15)) to the test that uses the identity permutation only (i.e. that uses p I reg (y) from (22)).
For each interval, we compare the method that uses the full conditioning set (i.e. (17)) to the method that uses the identity permutation only (i.e. (23)). The results are displayed in Figure 11.
The left panel of Figure 11 shows that the power loss resulting from using (22) instead of (15) is negligible. In fact, we need to zoom in on the left panel, as shown in the center panel, to see any separation between the power curves. We see in the center panel that power is lower when (22) is used, though we emphasize that the differences in power are extremely small.
We see in the right panel of Figure 11 that the computationally efficient conditioning set has a more noticeable impact on the median width of our confidence intervals. As expected, the confidence intervals are narrower when we use the full conditioning set (i.e. (17)). However, this difference is most noticeable when b is small. When b is small, in which case the confidence intervals are wide even when the full conditioning set is used. Thus, the amount of precision lost overall by constructing confidence intervals using the identity permutation only (i.e. (23)) is not of practical importance.
Based on these results, we recommend using the identity permutation in practice, because it is the most computationally efficient choice and does not meaningfully reduce power or precision compared to the full conditioning set.

G Robustness to Non-Normality
In this section, we explore the performance of the selective Z-test under a global null when the normality assumption on Y is violated. For four choices of cumulative distribution function F , we generate Y i i.i.d.
∼ F such that all observations have the same expected value. We set n = 200, p = 10, and we grow trees to a maximum depth of 3. We plug in (n − 1) −1 n i=1 (y i −ȳ) 2 as an estimate of σ 2 , as it does not make sense to assume known variance for distributions with a mean-variance relationship. Figure 12 displays quantile-quantile plots of the p-values for testing H 0 : ν T sib µ = 0, using the test in (8). Figure 12 shows that despite the fact that our proposed selective inference framework was derived under a normality assumption, it yields approximately uniformly distributed p-values for Poisson(10), Bernoulli(0.5), and Gamma(1,10) data. We suspect that this is because P ⊥ ν Y is approximately independent of ν T Y for these distributions (see the proof of Theorem 1 in Appendix B). In the case of Bernoulli(0.1) data, which represents a particularly extreme violation of the normality assumption, the p-values from our selective Z-test are not uniformly distributed. As mentioned in Section 7, future work could involve characterizing conditions for F under which our selective Z-tests will approximately control the selective Type 1 error. Figure 13 is the same as the left panel of Figure 9, but the selective Z-inference is carried out withσ cons , from Section 5.7, rather thanσ SSE . The takeaways presented in Section 6 do not change. Figure 12: Quantile-quantile plots of the p-values for testing H 0 : ν T sib µ = 0 under a global null. A naive Z-test (green), sample splitting (blue), and selective Z-test (pink) were performed; see Section 5.2. Figure 13: A CART tree fit to the Box Lunch Study data. Each split has been labeled with a p-value (8), and each region has been labeled with a confidence interval (23). Inference is carried out by plugging inσ cons , from Section 5.7, as an estimate of σ.